226 research outputs found

    The frequency spectrum of finite samples from the intermittent silence process

    Get PDF
    It has been argued that the actual distribution of word frequencies could be reproduced or explained by generating a random sequence of letters and spaces according to the so-called intermittent silence process. The same kind of process could reproduce or explain the counts of other kinds of units from a wide range of disciplines. Taking the linguistic metaphor, we focus on the frequency spectrum, i.e., the number of words with a certain frequency, and the vocabulary size, i.e., the number of different words of text generated by an intermittent silence process. We derive and explain how to calculate accurately and efficiently the expected frequency spectrum and the expected vocabulary size as a function of the text size.Peer ReviewedPostprint (author's final draft

    Clustering Patients with Tensor Decomposition

    Get PDF
    In this paper we present a method for the unsupervised clustering of high-dimensional binary data, with a special focus on electronic healthcare records. We present a robust and efficient heuristic to face this problem using tensor decomposition. We present the reasons why this approach is preferable for tasks such as clustering patient records, to more commonly used distance-based methods. We run the algorithm on two datasets of healthcare records, obtaining clinically meaningful results.Comment: Presented at 2017 Machine Learning for Healthcare Conference (MLHC 2017). Boston, M

    Discontinuities in recurrent neural networks

    Get PDF
    This paper studies the computational power of various discontinuous real computational models that are based on the classical analog recurrent neural network (ARNN). This ARNN consists of finite number of neurons; each neuron computes a polynomial net-function and a sigmoid-like continuous activation-function. The authors introducePostprint (published version

    Learning probability distributions generated by finite-state machines

    Get PDF
    We review methods for inference of probability distributions generated by probabilistic automata and related models for sequence generation. We focus on methods that can be proved to learn in the inference in the limit and PAC formal models. The methods we review are state merging and state splitting methods for probabilistic deterministic automata and the recently developed spectral method for nondeterministic probabilistic automata. In both cases, we derive them from a high-level algorithm described in terms of the Hankel matrix of the distribution to be learned, given as an oracle, and then describe how to adapt that algorithm to account for the error introduced by a finite sample.Peer ReviewedPostprint (author's final draft

    Identifiability and transportability in dynamic causal networks

    Get PDF
    In this paper we propose a causal analog to the purely observational Dynamic Bayesian Networks, which we call Dynamic Causal Networks. We provide a sound and complete algorithm for identification of Dynamic Causal Networks, namely, for computing the effect of an intervention or experiment, based on passive observations only, whenever possible. We note the existence of two types of confounder variables that affect in substantially different ways the identification procedures, a distinction with no analog in either Dynamic Bayesian Networks or standard causal graphs. We further propose a procedure for the transportability of causal effects in Dynamic Causal Network settings, where the result of causal experiments in a source domain may be used for the identification of causal effects in a target domain.Preprin

    Machine learning assists the classification of reports by citizens on disease-carrying mosquitoes

    Get PDF
    Mosquito Alert (www.mosquitoalert.com/en) is an expert-validated citizen science platform for tracking and controlling disease-carrying mosquitoes. Citizens download a free app and use their phones to send reports of presumed sightings of two world-wide disease vector mosquito species (the Asian Tiger and the Yellow Fever mosquito). These reports are then supervised by a team of entomologists and, once validated, added to a database. As the platform prepares to scale to much larger geographical areas and user bases, the expert validation by entomologists becomes the main bottleneck. In this paper we describe the use of machine learning on the citizen reports to automatically validate a fraction of them, therefore allowing the entomologists either to deal with larger report streams or to concentrate on those that are more strategic, such as reports from new areas (so that early warning protocols are activated) or from areas with high epidemiological risks (so that control actions to reduce mosquito populations are activated). The current prototype flags a third of the reports as “almost certainly positive” with high confidence. It is currently being integrated into the main workflow of the Mosquito Alert platform.Postprint (published version

    An efficient closed frequent itemset miner for the MOA stream mining system

    Get PDF
    Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version

    Applying trust metrics based on user interactions to recommendation in social networks

    Get PDF
    Recommender systems have been strongly researched within the last decade. With the arising and popularization of digital social networks a new field has been opened for social recommendations. Considering the network topology, users interactions, or estimating trust between users are some of the new strategies that recommender systems can take into account in order to adapt their techniques to these new scenarios. We introduce MarkovTrust, a way to infer trust from Twitter interactions and to compute trust between distant users. MarkovTrust is based on Markov chains, which makes it simple to be implemented and computationally efficient. We study the properties of this trust metric and study its application in a recommender system of tweets.Postprint (published version

    Assessing spatiotemporal correlations from data for short-term traffic prediction using multi-task learning

    Get PDF
    Traffic flow prediction is a fundamental problem for efficient transportation control and management. However, most current data-driven traffic prediction work found in the literature have focused on predicting traffic from an individual task perspective, and have not fully leveraged the implicit knowledge present in a road-network through space and time correlations. Such correlations are now far easier to isolate due to the recent profusion of traffic data sources and more specifically their wide geographic spread. In this paper, we take a multi-task learning (MTL) approach whose fundamental aim is to improve the generalization performance by leveraging the domain-specific information contained in related tasks that are jointly learned. In addition, another common factor found in the literature is that a historical dataset is used for the calibration and the assessment of the proposed approach, without dealing in any explicit or implicit way with the frequent challenges found in real-time prediction. In contrast, we adopt a different approach which faces this problem from a point of view of streams of data, and thus the learning procedure is undertaken online, giving greater importance to the most recent data, making data-driven decisions online, and undoing decisions which are no longer optimal. In the experiments presented we achieve a more compact and consistent knowledge in the form of rules automatically extracted from data, while maintaining or even improving, in some cases, the performance over single-task learning (STL).Peer ReviewedPostprint (published version
    corecore